Explore Red Wine Dataset

Louis Tian

Key Question

What are the most important phyciochemical attributes associated percieved quality of red wine ?

A Quick Overview of the Dataset

This dataset contains a total of 1599 rows and 12 columns.

Top 5 rows
for aestheticity, the data is transposed is transpose

1 2 3 4 5
fixed.acidity 7.400 7.800 7.800 11.200 7.400
volatile.acidity 0.700 0.880 0.760 0.280 0.700
citric.acid 0.000 0.000 0.040 0.560 0.000
residual.sugar 1.900 2.600 2.300 1.900 1.900
chlorides 0.076 0.098 0.092 0.075 0.076
free.sulfur.dioxide 11.000 25.000 15.000 17.000 11.000
total.sulfur.dioxide 34.000 67.000 54.000 60.000 34.000
density 0.998 0.997 0.997 0.998 0.998
pH 3.510 3.200 3.260 3.160 3.510
sulphates 0.560 0.680 0.650 0.580 0.560
alcohol 9.400 9.800 9.800 9.800 9.400
quality 5.000 5.000 5.000 6.000 5.000

Each row of the dataset represents a observation of red wine. columns contains various objective phyciochemical attributes of the wines as well as average quality score.

Basic Statistics

Min. 1st Qu. Median Mean 3rd Qu. Max.
fixed.acidity 4.600 7.100 7.900 8.320 9.200 15.900
volatile.acidity 0.120 0.390 0.520 0.528 0.640 1.580
citric.acid 0.000 0.090 0.260 0.271 0.420 1.000
residual.sugar 0.900 1.900 2.200 2.539 2.600 15.500
chlorides 0.012 0.070 0.079 0.087 0.090 0.611
free.sulfur.dioxide 1.000 7.000 14.000 15.870 21.000 72.000
total.sulfur.dioxide 6.000 22.000 38.000 46.470 62.000 289.000
density 0.990 0.996 0.997 0.997 0.998 1.004
pH 2.740 3.210 3.310 3.311 3.400 4.010
sulphates 0.330 0.550 0.620 0.658 0.730 2.000
alcohol 8.400 9.500 10.200 10.420 11.100 14.900
quality 3.000 5.000 6.000 5.636 6.000 8.000

The table above provided some high level statistics for each variables in the dataset.

Correlation Matrix

The first step towards understanding the relationship between wine quality and physicochemical attributes is to compute the correlations. However, a large correlation matrix is hard to read and decipher, so I created a visualisation for the correlation matrix.

Visualise Correlation Matrix

Visualise Correlation Matrix

This correlation matrix visualisation uses, both size and color saturation to represent magnitude of the correlations. It uses colour hue to represent the direction of the correlations.

Using this correlation matrix, it is easy to see that alcohol and volatile.acidity has the strongest linear relationship with the quality of wine. Also, I found it interesting and so what surprising that the residual sugar have no correlation with the quality.

I wonder how good a linear model based on this two variables will be.

Dependent variable:
quality
alcohol 0.314***
(0.016)
volatile.acidity -1.384***
(0.095)
Constant 3.095***
(0.184)
Observations 1,599
R2 0.317
Adjusted R2 0.316
Residual Std. Error 0.668 (df = 1596)
F Statistic 370.379*** (df = 2; 1596)
Note: p<0.1; p<0.05; p<0.01

The R^2 is only 0.31, which is definitely note good enough.

The linear regression is only good if the relationship is actually linear. Given the poor performance of our simple linear model, I need to turn my attention to non-linear relationship.

Next, I will explore each phyciochemical attribute individually. Using visualisation, I hope I can uncover some non-linear relationships.

# Univariate and Bivariate Plots

Wine Quality

Histrogram of Wine Quality

Histrogram of Wine Quality

Wine Quality Frequency

Quality 1 2 3 4 5 6 7 8
Freq 0 0 10 53 681 638 199 18

The majority vast majority of wine has a rating of 5 or 6. 199 bottles of wine are rated at grade 7. Only 18 and 10 bottles are rated as 8 and 4 respectively.

Fix Acidity

Histogram for fix.acidity

Histogram for fix.acidity

Density for fix.acidity

Density for fix.acidity

quality vs fixed.acidity

quality vs fixed.acidity

The fix acidity has a distribution that is slightly skew to the right.

Ignoring those wine with the highest ratinga and judging from the quantiles, one might argue that the there is a positive relationship between the fixed acidity and quality. However, the variance of fixed acidity are high among all quality rating.

Volatile Acidity

Histogram for Volatile Acidity

Histogram for Volatile Acidity

Histogram for Volatile Acidity

Histogram for Volatile Acidity

quality vs Volatile Acidity
Boxplot and jitted data point

quality vs Volatile Acidity<br> *Boxplot and jitted data point*

The distribution of volitile acidity has a bimodal distribution, with the first modal around 0.4 and the second around 0.6. And from the density plot, we can see the second modal is largely contributed by wine with rating 5.

From both density plot and box plot we can see a strong negative relationship here. High volatile acidity means low quality. One can also see this relationship from the histogram, although not as obvious as the boxplot. There is also an increasing variance associated with lower the wine rating is.

The correlation bewteen volatile acidity and quanlity is -0.3905578.

Citric Acid

Histogram for Citric Acid

Histogram for Citric Acid

Histogram for Citric Acid

Histogram for Citric Acid

Histogram for Citric Acid

Histogram for Citric Acid

There are a lot of wine that don’t have any citric acid at all. Other than the spike just under 0.5, the distribution appears to be quite uniform until 0.5, where it start to fade off.

It is quite clear that higher quality wine tends to have higher citirc acid content.

Residual Suger

Histogram for Residual Suger

Histogram for Residual Suger

Distribution of Residual Suger By Quality Rating

Distribution of Residual Suger By Quality Rating

Quanlity vs Redisual Sugar (limit from 1 to 4)

Quanlity vs Redisual Sugar (limit from 1 to 4)

I made two boxplot this time. the second one is created for residual.sugar smaller than 3.5.

The majority of wine have residual sugar around 2.

As a non-wine drinker, it is somewhat surprising to me that I don’t see any relationship between residual sugar and quality here, as I would personally prefer a bit sweeter taste.

Chlorides

Histogram for Chlorides

Histogram for Chlorides

Density for Chlorides

Density for Chlorides

Quanlity vs Chlorides

Quanlity vs Chlorides

Similar to residual sugar, there are a few outliers with large chlorides amount in our sample. I generated a second boxplot with 0.2 as the cutoff point.

There is a weak negative relationship between quality and chlorides.

Free Sulfur Dioxide

Histogram for free.sulfur.dioxide

Histogram for free.sulfur.dioxide

Quanlity vs free.sulfur.dioxide

Quanlity vs free.sulfur.dioxide

The free sulfur dioxide has a long tail distrubution. There are little obvious relationship between quanlity and free sulfur dioxide.

Alcohol

Histogram for alcohol

Histogram for alcohol

Histogram for alcohol

Histogram for alcohol

Quanlity vs alcohol

Quanlity vs alcohol

While it is quite clear, that high quality wine (with rating of 7 or 8) tends to have higher alcohol content. It is not so clear for the mid-to-low range. On fact wine with rating 5 has the lowerest alcohol measured by quantiles.

The distribution is clearly not normal, and this could also contribute to the poor R^2 for our initial linear model.

Alcohol vs. Volatile.acidity

Multivariate Plots Section

We can clearly see a seperation of the high quality and low quality wine by looking at the color of the dots. However, there is still a lot of unexplained variance. As show in the facet grid above, group the data by additional variable does improve the seperation, however, only marginally.


Final Plots and Summary

Plot One

Correlation Matrix

Correlation Matrix

This visualisation provides a very compact visualisation for the correlation between the variables in the dataset. Both size and transparency of the circle are used to encode the magnitude of the correlation. The colour hue is used to represent the direction of the correlation. Using this visualisation, it is obvious to see that alcohol and volatile.acidity has the strongest linear relationship with the quality of wine.

Plot Two

Boxplots of alcohol content for various quality of wine

Boxplots of alcohol content for various quality of wine

Given the relatively strong linear correlation that was discovered, I was somewhat surprised to see this box plot. The lowest mean quality. The lowest mean quality appears at quality rating of 5. It’s almost seems the positive linear relationship only applys to mid to high quality wine.

Plot Three

Alcohol vs Volatile acidity by wine density and citric acid

Alcohol vs Volatile acidity by wine density and citric acid

Using the divergent color, one can see a clear clustering of lower quality and higher quanlity wine. However, there is still unexplain variation within the cluster.


Reflection

I started out the data exploration by calculating the correlation matrix. In particular, I am interested to see, which variables have the strong linear relationship with the wine quality. I found, among all of the variables, alcohol content and volatile acidity, has the strongly linear relationship with wine quality.

Using this information, I did a simple linear regression on the data set and found that those two variables explain only 31% of the total variation in quality.

At this point, I suspected that there might be some non-linear relationship between quality and some variables. So I plotted one histogram, and one boxplot against quality, for every of the variable. Somewhat to my surprise, I didn’t found any obvious and strong relationship. And despite the strong correlation, I found the relationship between alcohol and quality is not that linear.

Finally, I made some scatterplots with the quality encoded using a divergent colour palette. It is quite clear to me that there is a clear separation between the high (rating above 5) and low quality (rating below 5). However, one can also see a lot of noise.

After the exploration, I think it is quite likely there simply isn’t enough relevant data for actuate prediction. The qualities are measured as the average of ratings by at least three wine experts. While one might argue, when this average is taken from a large number of experts’ ratings, it forms a somewhat objective measurement, thanks to the Central Limit Theorem. However, when there is only a small number of experts’ ratings are in deriving the wine quality, the rating became very subjective and heavily influenced by the preference of the individual judges. This is especially true in our case because the rating is not even derived from the group of experts for all the wines. An expert might systematically wine lower than the others experts or vice versa.

A much better prediction might be possible if more granular details are available in the dataset, for example, each experts rating on each wine instead of the just a simple average.